
\def\graphwidth{0.9\linewidth}

The first experiment was to tune in the value of the compiler parameters.
 A machine learning algorithm was used to find the sweetspot for a set of
 inlining parameters. As described in Section~\ref{sec:ml}, we used
 SPSA - Simultaneous Perturbation Stochastic Approximation because we have no
 supposition on the function and on the search space.

The second experiment was performed to evaluate and compare two different cases,
 the case of single-runs, and the case of \CP\-runs. Both cases are analyzed and
 compared.

The third experiment was conducted to verify and 
  analyze behavior variability for random choice, simple sort and
  a version of the knapsack problem. Particularly the latter version presented better results.

%Indicating the inliner being evaluated in every performance measure
%would unacceptably complicate the notation.  When necessary, the
%identity of the inliner used to generate a result is made clear by
%context.

This study evaluates 14 instantiations of \FDI\ reward functions,
along with the \funcname{Benefit} static-reward function,
single-profile \FDO, and the default \llvm\ inliner.  The version of a
program transformed by an \FDI\ inliner is referred to simply by
the \FDI\ reward function used by the inliner.  For example,
the \funcname{mean} version of a program refers to the version of the
program created by applying the \FDI\ inliner using
the \funcname{mean} reward function.  This section outlines how
performance is measured and evaluated for these inliners, the
concrete \FDI\ reward functions evaluated, and the set of programs and
inputs used for the evaluation.


\subsection{Measuring Single-Run Performance}

Before an inliner's workload performance can be calculated, execution
times for runs on individual inputs must be determined.  In this
study, the execution time of one version of a program on a particular
input is computed as the 
%arithmetic average 
minimum of the execution times of three runs for this pairing.
Henceforth, any reference to execution time assumes this three-run
measurement.

The baseline execution time used in this study is obtained using
the \funcname{Never} static \FDI\ reward function, which
unconditionally returns -1.0.  Thus, \funcname{Never} will never
inline any call site where $\CostOf{\CS} > -2$.  Given the definition
of \CostOf{\CS} from \ref{inlining:scoring}, inlining is thus
limited to callees containing a single basic block, that are expected
to increase code size in the caller by no more than 3 instructions.

The performance of a particular version of a program on one input is
computed as the execution time of that version of the program on that
input, normalized by the execution time of \funcname{Never} on that
input.

\REM{
The default \llvm\ inliner is used as a baseline against which each of
the alternative inliners are compared.  The baseline execution time
for input $i$ is \blt{i}.  Both traditional \FDO\ and \FDI\ use a
training set, $u$.  The execution time on input $i$ for a the version
of the program trained on $u$ is $t_u(i)$.  Finally, workload
performance is calculated from normalized execution times: $\ntau_u(i)
= \frac{t_u(i)}{\blt{i}}$.
}

\subsection{Static Inlining}

Two static inliners are evaluated: \funcname{Benefit}, and the
default \llvm\ inliner, \funcname{Static}.  These inliners are
evaluated by simply taking the geometric mean of their normalized
execution times over all inputs in \Wfull.

\subsection{Single-Profile \FDO}

Traditional single-profile \FDO\ is referred to in this evaluation as
the \funcname{Single} inliner.  Inlining is performed using a simple
point reward function with \FDI, informed by a single profile.  Since
the profile contains a single value for each monitor, all simple point
reward functions are equal.  The evaluation uses the leave-one-in
methodology in a manner similar to that
of \cite{BerubePhD}. Each input $u \in \Wfull$ is used for
training, and then evaluated using $\Wfull/\{u\}$.  A performance score
for $u$ is calculated by taking the geometric mean of those normalized
times:
$$ \vect{gm}[u] = \sqrt[|\Wfull-1|]{ \prod_{i\in \Wfull/\{u\}} \ntau_u(i) } $$ 
The final performance for traditional \FDO\ is computed as the
geometric mean of each $\vect{\mu_g}[u]$:
$$ \mu_g = \sqrt[|\Wfull|]{ \prod_{u\in \Wfull} \vect{gm}[u] } $$


\subsection{\FDI\ Reward Functions}
\label{eval:rewards}

The goal of \CP-informed inlining is to increase performance across
all future program executions by considering the cross-input behavior
variations captured in the \CProf.  In particular, it strives to
minimize the worst-case negative impact on the performance of any
run. Performance degradation (versus an alternative inlining
algorithm) is caused by a failure to inline call sites that are
important in one or more program runs.  These missed candidates are
opportunity costs: the potential benefit of inlining is not realized
in those runs because the inlining budget was spent on other
candidates.  It is the responsibility of the reward function to
prioritize inlining candidates such that this opportunity cost is
minimized.  However, opportunity cost cuts both ways.  Inlining a call
that is important in only a small proportion of program runs may
consequently expend the code-growth budget before more
generally-beneficial inlining opportunities can be exploited.

%\begin{quote}
%{\bf Hypothesis:} Most inlining candidates that significantly
%  impact the performance of any run also improve performance across
%  most of the workload.  The number of inlining opportunities that are
%  particularly important for only a minority of runs is a small
%  fraction of the total number of inlining opportunities that
%  significantly impact program performance.
%\end{quote}
%
%This hypothesis guides the selection of the example reward functions
%evaluated in this study.  If correct, this hypothesis implies that
%scoring inlining candidates by their maximum expected benefit will
%identify both those few candidates that have a large impact in rare
%cases, as well as the candidates that provide benefit for the majority
%of the workload.  Furthermore, weighting candidates by coverage is not
%necessary because the number of rarely-beneficial candidates is
%small.  While inlining these candidates will consume some of the
%inlining budget, this merely prevents inlining a small number of calls
%that are expected to be the least beneficial among the set of calls
%that would otherwise be inlined.
%
%
%SIMPLEMETRICS = mean max
%POINTMETRICS = QPointQ=25 \
%	QPointQ=50 \
%	QPointQ=75 \
%	QPLinearQ=50,75 \
%	QPLinearQ=5,95 \
%	QPSqrtQ=50,75 \
%	QPSqrtQ=5,95
%RANGEMETRICS = QRangeQ=50,100 \
%	QRangeQ=25,75 \
%	QRangeQ=5,95 \
%	QRLinearQ=0,25,75,100 \
%	QRSqrtQ=0,25,75,100

\begin{table}
  \centering
  \begin{tiny}
  \input{Tables/qpoints}
  \end{tiny}
  \caption{Concrete quantile-based reward functions}
  \label{tab:qpoints}
\end{table}

As explained in \cite{BerubePhD}, \FDI\ provides three classes
of reward functions that use combined profiles.  Results are presented
for the mean and max {\it simple point} rewards, seven concrete
instantiations of {\it quantile point} rewards, and five concrete
instantiations of {\it quantile range} rewards.  The quantile points
selected for these example reward functions are listed
in \refTable{tab:qpoints}.  The type column, {\bf T}, indicates if the
reward functions uses (P)oints or (R)anges; this aspect of the reward
function is also evident in the Quantiles column, where ranges are
listed in angled brackets.  The combination column, {\bf C}, indicates
how multiple points or ranges are combined: (L)inearly or
(N)on-(L)inearly.  Evaluation name provides the notation used to
identify each inliner in the figures in \cite{BerubePhD}.

%The first reward functions sample the quartile boundaries, and
%represent somewhat pessimistic, median and somewhat optimistic
%rewards.  The remaining point metrics represent cases that combine a
%high and a low frequency estimate.  Non-linear combination (sqrt) uses
%a multi-objective approach that balances the importance of the two
%measurements.  The first two range-based rewards consider the weighted
%average frequency over half of the histogram's weight.  The third
%range-based reward uses the weighted average over the whole histogram,
%excluding any high or low outliers.  The final two reward metrics
%combine the weighted average frequency from the top and bottom
%quartiles, balancing optimistic and pessimistic frequency predictions.

The inliners using these reward functions are evaluated according to
the 3-fold cross-validation methodology proposed in
\cite{BerubePhD}.  The testing and training sets are
determined once from a random ordering of the inputs in \Wfull.
Identical testing and training sets are used in each fold by each
inliner.  Each input in \Wfull\ is in exactly one testing set, thus the
workload performance of an inliner is determined by taking the usual
geometric mean:
$$ \mu_g = \sqrt[|\Wfull|]{ \prod_{i \in \Wfull} \ntau(i) } $$

\subsection{Programs and Inputs}


This study evaluates the inliners described above using four
programs: \bzip, \gzip, \gcc, and \gobmk.  Each program is evaluated
using a 15-input workload, as suggested
in \cite{BerubePhD}.  \Gcc\ and \gobmk\ are taken from the
SPEC CPU 2006 benchmark suite.  SPEC provides 11 inputs for \gcc.  In
spite of the challenges involved in creating new inputs for this
benchmark, four\footnote{of seven attempts} of the SPEC 2000 benchmark
programs were converted to the single pre-processed file format.  The
converted programs are \bzip, \lbm, \mcf, and \parser.  For \gobmk,
SPEC provides 20 inputs.  However, only 5 of these inputs come from
the {\tt ref} workload; the {\tt train} workload contains 8 inputs,
and the {\tt test} workload contains 7 inputs.  Many of the inputs from
{\tt test} and {\tt train} have very short execution times: 4 inputs
take less than 1 second, 6 take 2--9 seconds, 4 take 12--19 seconds,
and 1 takes longer than 1 minute.  Execution times of less than a few
seconds are subject to large proportional timing imprecision, because
the Linux {\tt time} command reports times with a resolution of
1/100$^{th}$ of a second.  Therefore, the 15 longest-running inputs
are chosen for \Wfull.  This set is composed of the {\tt ref} and {\tt
train} SPEC workloads, plus \iname{connect} and \iname{dniwog} from
{\tt test}.  The shortest baseline running time in \Wfull\ is 2.3
seconds, for \iname{connect}.

The other two programs used in the case study are \bzip\ and \gzip.
However, rather than using the SPEC benchmark versions of these
programs, the fully-functional ``real'' versions are used.  Using the
real versions of the compressor programs eliminates the
unrealistically-simplified profiling situation where
mutually-exclusive use cases are combined into a single program run.
Consequently, these programs cannot do decompression and compression,
or multiple levels of compression, within the same run.  These
distinct use-cases must be covered by different inputs in the program
workload.  Both \bzip\ and \gzip\ share the same workload of inputs.
This workload is split in half into a compression set and a
decompression set.  Several inputs in the compression set have an
analogue in the decompression set.  However, the file format is
usually different, and the source of the data is never the same.  For
instance, \iname{revelation\mbox{-}ogg} in the compression set
and \iname{sherlock\mbox{-}mp3} in the decompression set are both audio
books, but the audio is recorded in different formats, and the books
themselves are different.

Both compressors use a numeric command-line flag to control the
tradeoff between compression speed and compression quality.  The flags
take integer values between 1 (fastest, least compressed) and 9
(slowest, most compressed).  The seven inputs in the compression set
each use a different compression level, from 3 to 9.  Most inputs are
collections of files.  Each collection is archived (uncompressed) so
that the input and output of each run is a single file.  In order to
minimize the impact of disk access, the output of each run is
redirected to {\tt /dev/null}.

The compression set contains the following inputs, with the
compression level shown in parentheses:
\begin{itemize}

\item {\tt avernum (-3)}: The installer for the demo version of the game
  ``Avernum: Escape from the Pit'' from Spiderweb Software.

\item {\tt cards (-4)}: A collection of greeting card layouts in the TIFF 
  (uncompressed) image format.

\item {\tt ebooks (-5)}: A collection of ebooks, with and without images,
  and in a variety of formats, from Project
  Gutenberg\footnote{http://www.gutenberg.org}.

\item {\tt potemkin-mp4 (-6)}: The 1925 movie ``Bronenosets Potyomkin
  (Battleship Potemkin)'' in MP4 format, from the Internet
  Archive\footnote{http://archive.org/details/BattleshipPotemkin}.

\item {\tt proteins-1 (-7)}: A sample of 33 proteins from the RCSB Protein Data
  Bank database.  6 files for each protein, each stored in a different
  text-based format, provide different characteristics of the protein's
  structure\footnote{http://www.rcsb.org}.

\item {\tt revelation-ogg (-8)}: The audio book ``The Revelation of Saint
  John'' in OGG format, from Project
  Gutenberg\footnote{http://www.gutenberg.org/ebooks/22945}.

\item {\tt usrlib-so (-9)}: A collection of shared object (.so) files from {\tt
  /usr/lib/} of a 32-bit gentoo-linux machine.

\end{itemize}

The decompression set for each compressor uses the same base set of
files, pre-compressed by the appropriate compressor at the default
compression level.  The decompression set is composed of:
\begin{itemize}
\item {\tt auriel}: The ``Auriel's Retreat'' land-mass addition mod by
  lance4791 for the game ``The Elder Scrolls IV: Oblivion'' from
  Bethesda
  Softworks\footnote{http://planetelderscrolls.gamespy.com/View.php?view=\\ \hspace*{150 pt}OblivionMods.Detail\&id=5949}.

\item {\tt gcc-453}: The source-code archive of the \gcc\ compiler,
  version 4.5.3\footnote{http://gcc.gnu.org/gcc-4.5}.

\item {\tt lib-a}: A collection of library files (.a) from {\tt /lib/} of a
  gentoo-linux machine.  As per the gentoo development guide, a
  library will be installed in {\tt /lib} (boot critical) or {\tt
    /usr/lib} (general applications), but not both\footnote{
    http://devmanual.gentoo.org/general-concepts/filesystem/index.html}.

\item {\tt mohicans-ogv}: The 1920 movie ``Last of the Mohicans'' in OGV
  (ogg video) format, from the Internet
  Archive\footnote{http://archive.org/details/last\_of\_the\_mohicans\_1920}.

\item {\tt ocal-019}: The Open Clip Art Library archive, version 0.19.  The
  images are primarily in vector-graphics
  formats\footnote{http://openclipart.org/collections}.

\item {\tt paintings-jpg}: A collection of watercolor paintings, in JPG format.

\item {\tt proteins-2}: A completely different sample of 157 proteins from
  the RCSB Protein Data Bank database, each in 6 different file
  formats.

\item {\tt sherlock-mp3}: The audio book ``The Adventures of Sherlock
  Holmes'' in MP3 format, from Project
  Gutenberg\footnote{http://www.gutenberg.org/ebooks/28733}.

\end{itemize}
